Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

The purpose of this study is to help Thera bank by coming up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Dictionary:

Define functions needed for data analysis

Read the dataset

View the head and random sample of the data set

The CLIENTNUM column looks like a client identifier and can be dropped. The income category has various ranges and is a mix of numbers and characters. Also one of the values is abc, this can be looked into more closely.

Check the data types of the columns for the dataset.

Education_level and Marital_Status have nulls. Most of the columns are numeric except for Attrition_Flag, Gender, Education_level, Marital_Status, Income_Category, and Card_Category, these columns are of object datatype. We can look into value counts to see if some of the columns can be made categorical. Attrition_Flag is our target column and it is defined as object type have to look into this further.

Feature Engineering

Looking at unique values for data

  1. The CLIENTNUM has 10,127 unique rows which clearly shows that its an identifier and can be dropped.
  2. Attrition_Flag is a boolean. Also it is of object data type which we have to fix as this is the target variable.
  3. Gender has 2 values for Male and Female
  4. Dependent_count looks categorical
  5. Education_Level also has only 6 values and could be categorical
  6. Marital Status has 3 values and looks categorical.
  7. Income_Category also has only 6 values and could be categorical
  8. Card_Category only has 4 vlaues and could be categorical
  9. Total_Relationship_Count only has 6 categories
  10. Months_Inactive_12_mon has 7 unique values
  11. Contacts_Count_12_mon also has 7 unique values
  1. Attrition_Flag is defined as object but its our target variable so this is updated to values of 0 and 1. Also it shows an imbalance in our target column. Most of the customers are Existing customers and very few are Attrited Customers.
  2. Gender had 2 unique values M and F. There are slightly more Females, this can be made categorical.
  3. Dependent_Count shows the number of dependednts, it ranges from 0 to 5, this can be made categorical.
  4. Marital_Status has 3 categories , Married, Single and Divorced. It can be made categorical.
  5. Income_Category has 6 categories, one of the categories is abc which does not look correct.
  6. Card_Category has 4 categories, Blue, Gold, Silver and Platinum. Most of the customers have Blue credit card. This can be made into a category.
  7. Total_Relationship_Count has 6 categories and can be made categorical.

There are 1,112 rows where the Income_Category is abc. This looks like a data entry error and will be updated with 'unknown'

Summary of the data

  1. For Customer_Age the mean and median are very close with 46.3 and 46 respectively. The maximum age is 73 which is above 75th percentile but it seems normal to have the maximum age of 73.
  2. Months_on_book has outliers on both ends, the max value is 56 which is above the 75th percentile and its min value is 13 which is well below the 25th percentile.
  3. Credit_Limit has a wide range, it also has outliers on both sides.The max credit limit is 34,516.
  4. Total_Revolving_Bal, Avg_Open_To_Buy, Total_Trans_Amt, Total_Trans_Ct also have outliers on both ends.

EDA

Univariate Analysis

Creating histogram and barplot for every numerical field

  1. Attrition_Falg is a boolean and we can see that there are more Existing Customers than Attrited customers
  2. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months). This has many upper outliers, but this could be reasonable since there's a wide range for credit card limit too. Also this is right skewed. We can look at it in more detail.
  3. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent. This column does not have any outliers and the distribution is right skewed. Most of the customers have low utilization ratio
  4. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months. This has outliers on both ends, but its understandable that some customers have not been contacted at all, while others have been contacted more than 4 times. We can look at it in more detail.
  5. Credit_Limit also has outliers, but only upper outliers. it is right skewed. We can look at outliers in more detail.
  6. Customer_Age is mostly a normal distribution, it has a few upper outliers. The maximum age is above 73 which is understandable and possible.
  7. Months_Inactive_12_mon: No. of months inactive in the last 12 months. This has 7 values and mostly it is inactive for 3 months, then 2 followed by one month. Being in active for more than 4 months is considered as an outlier.
  8. Months_on_book: Period of relationship with the bank. we can see that most customers have a relationship of around 36 months. There are a few lower oultiers past 25 and few upper outliers past 55.
  9. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter. This has normal distribution but has many outliers. we can treat these outliers.
  10. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter. This is normally distributed but it has upper and lower outliers. we can treat these outliers.
  11. Total_Revolving_Bal is slightly left skewed and does not have any outliers
  12. Total_Trans_Amt is more like a wave, it has many upper outliers. we can treat these outliers.
  13. Total_Trans_Ct is normally distributed and has a few upper outliers. we can treat these outliers.

There are 835 rows where the Avg_OpenTo_Buy is more than 25, 000, but this looks normal as the credit limit is also more than 25, 000. Both Credit Limit and Avg_Open_to_Buy have outliers as there is a wide range of credit limits that customers can have. The max credit limit is 34,516 and the max avg_open_to_buy is also 34, 516. Since there are only 835 rows we will treat the outliers by capping

It looks like 64% of the customers were contacted 2 or 3 times. Around 14% of the customers were contacted 1 or 4 times. only 4% of the customers were contacted 0 times. And less than 2% of the customers were contacted 5 times and less than half percent were contacted 6 times. The outliers are very minimal so we will go ahead and treat the outliers lower and upper outliers

There are 892 rows where the Credit_Limit is more than 25000. The maximum credit_Limit is 34516. This is understandable as different customers will have different credit limits. Since there are only 892 rows we can treat the outliers.

Creating barplot for categorical or object data types

  1. There are slightly more females 52.9% than males at 47.1%
  2. Dependednt_Count has 6 categories. Most customers have 2 or 3 dependents at around 26% each. Then 18% of the customers have 1 dependednt. 15.5% of the customers have 4 dependents. Around 9% have 0 dependents and 4.2% have 5 dependents.
  3. There are 6 education levels. the most common is Graduate with 31% of the customers. Then theres high school at 20%, Followed by uneducated at 15%, then College at 10%. Doctorate and Post Graduate are at around 5%.
  4. The most common marital_status is married at 46.3%, then Single at 38.9%. And only 7.4% are divorced.
  5. There are 6 Income categories. Most of the customers 35.2%, earn less than 40k. 17.7% earn between 40k to 60k. 15% earn 80k to 120k. 14% earn 60k to 80 k. 11% of the customers have unknown Income_Category. and only 7.2% of the customers have more than 120k.
  6. There are 4 card categories. The most common is Blue with 93.2% of the customers having this category, then there's Silver with only 5.5%, followed by Gold at 1.1% and Platinum at 0.2%
  7. Total_Relationship_Count is the number of products held by the customer. Most customers had three products at 22.8% then customers had 4, 5 and 6 products at 18%. 12.3% of the customers had only 2 products and 9% had only one product.

Bivariate analysis

  1. This shows a strong corelation between Customer_Age and Months_on_book. This is reasonable as the the older customers will have more number of months with the bank. We can drop Months_on_book as it is highly corelated with another independednt variable.
  2. There is a strong corelation with total_Trans_Amt and total_trans_ct. We can keep Total_Trans_Ct as it is more corelated with the Attrition_Flag than Total_Trans_Amt. We will drop Total_Trans_Amt as it is highly corelated.
  3. Avg_Utilization_Ratio is negatively related to Credit Lmimit and Avg_Open_to_buy
  4. There's a positive relationship between Avg_Open_to_buy and Credit_Limit
  5. Avg_utilization_Ratio is positively related with Toal_revolving_Bal.
  6. Attrition_Flag is slightly negatively corealted with Months_Inactive_12_mon and Contacts_Count_12_mon.
  7. Attrition_Flag is also slightly related with Total_TransCt and Total_Ct_Chng_q4_q1
  8. Total_Amt_Chng_Q4_Q1 is slightly corealted with Total_Ct_Chng_Q4_Q1

Lets look at categorical columns with Attrition_Flag

  1. The attrition between male and female is almost the same.
  2. Attrition for different dependents is almost the same.
  3. Attrition for customers with Doctorate and Post-Grduate degree is slightly more than other education categories.
  4. Attrition is same across Marital_status
  5. Attrition is slightly more for Customers with Incomes greater than 120 and income less than 40K as compared to incomes between 40k-60k, or 60 -80k.
  6. Customers with Platinum card have the most attrition , followed by Gold and Blue. Silver has the least attrition
  7. Customers with Total_Relationship_Count of 1 or 2 have the most attrition, then its customers with total_relationship_Count of 3. Customers with total_Relationship_Count of 4, 5 or 6 have the least attrition.

Attrition_Flag vs Age

There is a small spike of attrition where the customer age is 30, then there are 2 more strong spikes at age above 65 and less than 70.

Attrition_Flag vs Total_Trans_Ct

Customers who attrited have a very low total_trans_ct as compared to existing customers.

Attrition_Flag vs Credit_Limit

Customers who attrited had a lesser credit limit as compared to existing customers. The data is skewed for both classes.

Attrition_Flag vs Months_Inactive_12_mon

Customers who attrited had accounts that are inactive for 12 months than Existing Customers

Grouping data w.r.t to packages to build customer profiles

Customer Profile for customers who opt for Platinum Card is as follows:

  1. Average Age of customers is 47.
  2. Average Credit_Limit is 32,045
  3. Average Total_Revolving_balance is 1602
  4. Average Open to buy is 30,442
  5. Average Total_Trans Ct is 95.9
  6. Average Utilization_Ratio is 0.055
  7. Top Gender is Male
  8. Top Education is Graduate
  9. Top Marital_Status is Single
  10. Top Income_Cateogry is 120k+
  11. Top Dependent_Count is 3

Customer Profile for customers who opt for Gold Card is as follows:

  1. Average Age of customers is 45.7.
  2. Average Credit_Limit is 28,093
  3. Average Total_Revolving_balance is 1414
  4. Average Open to buy is 26678
  5. Average Total_Trans Ct is 86.3
  6. Average Utilization_Ratio is 0.06
  7. Top Gender is Male
  8. Top Education is Graduate
  9. Top Marital_Status is Single
  10. Top Income_Cateogry is 60k to 80k
  11. Top Dependent_Count is 3

Customer Profile for customers who opt for Blue Card is as follows:

  1. Average Age of customers is 46.3.
  2. Average Credit_Limit is 7,469
  3. Average Total_Revolving_balance is 1252
  4. Average Open to buy is 6217
  5. Average Total_Trans Ct is 67.8
  6. Average Utilization_Ratio is 0.31
  7. Top Gender is Female
  8. Top Education is Graduate
  9. Top Marital_Status is Married
  10. Top Income_Cateogry is Less than 40k
  11. Top Dependent_Count is 2

Customer Profile for customers who opt for Silver Card is as follows:

  1. Average Age of customers is 45.4.
  2. Average Credit_Limit is 25,519
  3. Average Total_Revolving_balance is 1297
  4. Average Open to buy is 23861
  5. Average Total_Trans Ct is 78.4
  6. Average Utilization_Ratio is 0.06
  7. Top Gender is male
  8. Top Education is Graduate
  9. Top Marital_Status is Single
  10. Top Income_Cateogry is Less than 40k
  11. Top Dependent_Count is 3

These profiles can act as a preliminary step to categorize customers for different packages and based on these profiles:

Outlier Treatment

The outliers have been treated

Summary of EDA and Data cleansing

  1. There are 10,127 rows and 21 columns
  2. There are 0 duplicate rows.
  3. Education_level and Marital_Status have nulls
  4. Attrition_Flag is our target column and is initially defined as object data type. 'Existing_Customer' was replaced with 1 and Attrited_Customer replaced with 0. This column was made numeric
  5. Gender, Dependent_count, Marital_Status, Income_Category, Card_Category, and Total_Relationship_Count were made categorical.
  6. The CLIENTNUM has 10,127 unique rows which clearly shows that its an identifier and was dropped.
  7. Avg_Open_To_Buy,Contacts_Count_12_mon,Credit_Limit,Customer_Age,Months_Inactive_12_mon,Total_Amt_Chng_Q4_Q1, Total_Ct_Chng_Q4_Q1,Total_Trans_Ct were treated for outliers
  8. Income_Category has 6 categories, one of the categories was abc which does not look correct. This was updated to 'unknown'
  9. Dropped Months_on_book as it is highly corelated with Customer_Age, another independent variable.
  10. Dropped Total_Trans_Amt as it is highly corelated with total_Trans_Ct

Summary of Univariate analysis for categorical columns

  1. There are slightly more females 52.9% than males at 47.1%
  2. Dependednt_Count has 6 categories. Most customers have 2 or 3 dependents at around 26% each. Then 18% of the customers have 1 dependednt. 15.5% of the customers have 4 dependents. Around 9% have 0 dependents and 4.2% have 5 dependents.
  3. There are 6 education levels. the most common is Graduate with 31% of the customers. Then theres high school at 20%, Followed by uneducated at 15%, then College at 10%. Doctorate and Post Graduate are at around 5%.
  4. The most common marital_status is married at 46.3%, then Single at 38.9%. And only 7.4% are divorced.
  5. There are 6 Income categories. Most of the customers 35.2%, earn less than 40k. 17.7% earn between 40k to 60k. 15% earn 80k to 120k. 14% earn 60k to 80 k. 11% of the customers have unknown Income_Category. and only 7.2% of the customers have more than 120k.
  6. There are 4 card categories. The most common is Blue with 93.2% of the customers having this category, then there's Silver with only 5.5%, followed by Gold at 1.1% and Platinum at 0.2%
  7. Total_Relationship_Count is the number of products held by the customer. Most customers had three products at 22.8% then customers had 4, 5 and 6 products at 18%. 12.3% of the customers had only 2 products and 9% had only one product.

Summary of Univariate analysis for numeric columns

  1. For Customer_Age the mean and median are very close with 46.3 and 46 respectively. The maximum age is 73 which is above 75th percentile but it seems normal to have the maximum age of 73.
  2. Months_on_book has outliers on both ends, the max value is 56 which is above the 75th percentile and its min value is 13 which is well below the 25th percentile.
  3. Credit_Limit has a wide range, it also has outliers on both sides.The max credit limit is 34,516.
  4. Total_Revolving_Bal, Avg_Open_To_Buy, Total_Trans_Amt, Total_Trans_Ct also have outliers on both ends.
  5. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months). This has many upper outliers, but this could be reasonable since there's a wide range for credit card limit too. Also this is right skewed. We can look at it in more detail.
  6. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent. This column does not have any outliers and the distribution is right skewed. Most of the customers have low utilization ratio
  7. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months. This has outliers on both ends, but its understandable that some customers have not been contacted at all, while others have been contacted more than 4 times. We can look at it in more detail.
  8. Credit_Limit also has outliers, but only upper outliers. it is right skewed. We can look at outliers in more detail.
  9. Customer_Age is mostly a normal distribution, it has a few upper outliers. The maximum age is above 73 which is understandable and possible.
  10. Months_Inactive_12_mon: No. of months inactive in the last 12 months. This has 7 values and mostly it is inactive for 3 months, then 2 followed by one month. Being in active for more than 4 months is considered as an outlier.
  11. Months_on_book: Period of relationship with the bank. we can see that most customers have a relationship of around 36 months. There are a few lower oultiers past 25 and few upper outliers past 55.
  12. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter. This has normal distribution but has many outliers. we can treat these outliers.
  13. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter. This is normally distributed but it has upper and lower outliers. we can treat these outliers.
  14. Total_Revolving_Bal is slightly left skewed and does not have any outliers
  15. Total_Trans_Amt is more like a wave, it has many upper outliers. we can treat these outliers.
  16. Total_Trans_Ct is normally distributed and has a few upper outliers. we can treat these outliers.

Summary of bivariate analysis

  1. There's strong corelation between Customer_Age and Months_on_book. This is reasonable as the the older customers will have more number of months with the bank. We can drop Months_on_book as it is highly corelated with another independent variable.
  2. There is a strong corelation with total_Trans_Amt and total_trans_ct. We can keep Total_Trans_Ct as it is more corelated with the Attrition_Flag than Total_Trans_Amt. We will drop Total_Trans_Amt as it is highly corelated.
  3. Avg_Utilization_Ratio is negatively related to Credit Limit and Avg_Open_to_buy
  4. There's a positive relationship between Avg_Open_to_buy and Credit_Limit
  5. Avg_utilization_Ratio is positively related with Toal_revolving_Bal.
  6. Attrition_Flag is slightly negatively corealted with Months_Inactive_12_mon and Contacts_Count_12_mon.
  7. Attrition_Flag is also slightly related with Total_TransCt and Total_Ct_Chng_q4_q1
  8. Total_Amt_Chng_Q4_Q1 is slightly corelated with Total_Ct_Chng_Q4_Q1
  9. The attrition between male and female is almost the same.
  10. Attrition for different dependents is almost the same.
  11. Attrition for customers with Doctorate and Post-Grduate degree is slightly more than other education categories.
  12. Attrition is same across Marital_status
  13. Attrition is slightly more for Customers with Incomes greater than 120 and income less than 40K as compared to incomes between 40k-60k, or 60 -80k.
  14. Customers with Platinum card have the most attrition , followed by Gold and Blue. Silver has the least attrition
  15. Customers with Total_Relationship_Count of 1 or 2 have the most attrition, then its customers with total_relationship_Count of 3. Customers with total_Relationship_Count of 4, 5 or 6 have the least attrition.
  16. Customers who attrited had a lesser credit limit as compared to existing customers.
  17. Customers who attrited have a very low total_trans_ct as compared to existing customers.
  18. There is a small spike of attrition where the customer age is 30, then there are 2 more strong spikes at age above 65 and less than 70.
  19. Customers who attrited had accounts that are inactive for 12 months (Months_Inactive_12_mon) than Existing Customers

Model evaluation criterion

The model can make wrong predictions as:

Which case is more important?

Predicting a customer will not attrite, and the customer does attrite.

Which metric to optimize?

We would want Recall to be maximized, the greater the Recall higher the chances of minimizing false negatives.

Model Building

Splitting data set into train, validation and test

Missing Value Treatment

Checking columns for nulls

To prevent Data Leakage, we will treat the missing values after splitting data set

Education_Level and Marital_Status have nulls.

Oversampling data using SMOTE

Undersampling train data using Random Under Sampler

Hyperparameter tuning GBM

Checking model performance

Hyperparameter tuning did help to improve the Recall score, the recall score for the training data is 0.995. Lets check validation data to confirm that the model is not overfitting

Hyperparameter tuning for Random Forest

Checking model performance

Hyperparameter tuning for XGBoost

Checking model performance

Recall is 1 so the model is overfitting

Comparing all models

XGBoost tuned gave the recall score of 1 but it is over fitting. Then GBM Tuned had a good recall at 0.995 followed by Random Forest at 0.993. All the models might be overfitting, we can check with validation data.

Calculating different metrics on validation set

Validation Performance for GBM tuned

Validation Performance for Random Forest tuned

Validation Performance for XGB tuned

XGB performance is best with Recall 1 for validation set. Then GBM tuned with Recall of 0.993 and Random Forest with 0.992.

Feature Importance using sklearn

The top features that are indicated in the three models are Total_Trans_Ct, Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio, Total_Revolving_Bal, Total_Amt_Chng_Q4_Q1, Contacts_count_12_mon, and Months_Inactive_12_mon.

Performance on test set

The performance on test data was also very good, the Recall is 1 for test data too.

Pipeline for productionizing the model

Business Recommendation

  1. The top features that are indicated in the three models are Total_Trans_Ct, Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio, Total_Revolving_Bal, Total_Amt_Chng_Q4_Q1, Contacts_count_12_mon, and Months_Inactive_12_mon. The business should look into the total_trans_ct, Total_Ct_Chng_Q4_Q1, Total_Revolving_Bal, Total_Amt_Chng_q4_Q1 for customers. If these are low then the customers will attrite. Customers with high numbers for these columns had low attrition.
  2. They should also look into Contacts_count_12_mon. Customers who were contacted more had less attrition. Customers who were contacted less had more attrition. The business should invest in reaching out to the customer more proactively.
  3. Months_Inactive_12_mon is also important, Customers whose accounts were inactive had higher attrition
  4. Attrition for customers with Doctorate and Post-Grduate degree is slightly more than other education categories. The business can look into customers with higher degrees and see if they can reduce attrition in that group
  5. Attrition is slightly more for Customers with Incomes greater than 120 and income less than 40K as compared to incomes between 40k-60k, or 60 -80k. The business can look into proactively contacting customers with higher incomes and check on satisfaction rate.
  6. Customers with Platinum card have the most attrition , followed by Gold and Blue. Silver has the least attrition. The business can contact Platinum card customers and ensure that they are satisfied with the service.
  7. Customers with Total_Relationship_Count of 1 or 2 have the most attrition, then its customers with total_relationship_Count of 3. Customers with total_Relationship_Count of 4, 5 or 6 have the least attrition. This shows that customers who were not contacted much had higher attrition, the business can look into customer satisfaction for this group
  8. Customers who attrited had a lesser credit limit as compared to existing customers. The business can also look into the credit limit and see if increasing the credit limit will increase customer satisfaction.
  9. Customers who attrited have a very low total_trans_ct as compared to existing customers. The business can check on customer with low total_trans_ct and proactively contact them to ensure customer satisfaction
  10. There is a small spike of attrition where the customer age is 30, then there are 2 more strong spikes at age above 65 and less than 70. The business can look into customers in the age group of 30, 65 and 70 to see if they are satisfied with the service